Getting started with polygenic scores
Overview
Over this session we are going to learn about what polygenic scores (PGS) are and how they came to be. Then we will learn how to calculate them and use them in research.
- Background
- Calculating PGS
- Examples of PGS in research
- Problems with PGS
- PGS in practice (using both PRSice2 and PRScs)
Background
Polygenic scores (also called polygenic risk scores) were first introduced around 15 years ago by the International Schizophrenia Consortium with the release of this article.
International Schizophrenia Consortium PGS article
As a summary, the researchers performed a genome-wide association study (GWAS) using the International Schizophrenia Consortium (ISC) data. First, they reported that common genetic variation underlies risk of schizophrenia, most notably, they found more than 450 common variants within the major histocompatibility complex (MHC) associated with risk for schizophrenia. This was the GWAS component of their research, and through this they implicated variation in the MHC locus in schizophrenia risk.
The second part of this study was the introduction of a novel βscoringβ technique. This was an early version of the PRS we know today. The rationale for this was because there was concern that the initial GWAS was not sufficiently powered to find all the many genetic variants of small effect that are out there. By aggregating these effects, they can look at the cumulative risk associated with thousands and thousands of SNPs that may not have been identified as genome-wide significant (GWS) through the GWAS. They calculated these PGS and found that i) they were significantly associated with schizophrenia and ii) there was substantial overlap between risk for schizophrenia and risk for bipolar disorder (but not between schizophrenia and non-psychiatric diseases such as diabetes, coronary artery disease, and rheumatoid arthritis).
As demonstrated above, PGS are a method of combining the effects of many SNPs into one risk score. This risk score can be used to test whether genetic variants, specifically alleles of small effect, confer an aggregate risk to a phenotype of interest and then also whether the same sets of variants are shared between cohorts, datasets, and disorders.
Genome Wide Association Studies (GWAS) are an important research method in themselves, however, they are also crucial to the calculation of PGS. As a recap, GWAS involves exploring genetic data of a large number of people to determine whether there are any genetic markers that are associated with a trait or phenotype of interest. Often, this might be a binary trait such as schizophrenia presence/absence (βcase-control GWASβ), but traits can be quantiative too (e.g., height, IQ). GWAS can be performed on any trait, provided there is available data. Searching GWAS catalogue gives you an example of this, with GWAS for knee pain, smoking and spelling ability, house hold income, and risk-taking behaviour as some examples. Results of GWAS are provided in the form of summary statistics which provide, amongst other things, a measure of effect size (beta/OR) and a p value for each SNP tested. These sumstats are useful in the calculation of polygenic scores.
In laymanβs terms, a polygenic risk score is a measure of your (common, genetic) risk towards a given disorder or trait. People with high scores are said to have a higher genetic liability towards the phenotype (and so are more likely to develop or have the phenotype), whereas people with lower scores have a lower genetic liability towards the phenotype (and so are less likely to develop or have the phenotype). This video gives a nice overview to the topic.
The actual number of the PGS is not so important - in practice, scores are standardised prior to use so we interpret PGS in units of standard deviation from the mean.
Image from Centers for Disease Control and Prevention.
Calculating Polygenic Scores
This is the general formula for calculating PGS, where:
- \(s\) = subject (participant).
- \(M\) = all the SNPs in the dataset
- \(\beta_i\) = effect size (\(\beta\) / \(OR\) for SNP\(_i\))
- \(n_i\) = number of risk alleles for SNP\(_i\)
\[ PRS_s = \sum_{i = 1}^{M}\beta_i n_i \]
At minimum, calculating PGS requires two pieces of information:
- summary statistics from a GWAS of your trait of interest.
- genetic data (this must be from a separate sample than what the GWAS was conducted on)
The most common way to get GWAS summary statistics (aka the discovery dataset) is from an existing GWAS, ideally on a large sample and from a trusted source. For example if you are calculating polygenic scores for psychiatric traits then you might want to download them from the PGC website. Other places to find GWAS might include from GWAS catalogue or from the supplemental materials of a paper. Running the GWAS yourself is also a possibility, but it does require you have two distinct, decently sized samples (which can be more challenging to obtain than one). The per SNP effect sizes are obtained from the GWAS summary statistics. In contrast, the number for risk alleles per SNP, per subject are obtained from the genetic information (aka the test dataset).
Overall, this equation means that for every participant in our test dataset, we are multiplying the number of effect alleles for each SNP by its effect size. This is done for each SNP in the dataset and then these are added altogether. After this is done for each participant in our dataset, we standardise the PGS, as previously mentioned.
Effect Sizes are an important concept to understand. The most common effect sizes given in GWAS summary statistics are \(\beta\) and \(OR\); they are numbers that capture the strength and direction of the relationship between two variables.
- \(\beta\) is mainly used for continuous traits and is the usual output of a linear regression model (or a quantiative trait GWAS). They may take on a range of values, both negative and positive. Negative numbers reflect a negative relationship, positive numbers reflect a positive relationship. and a \(\beta\) of 0 reflects no relationship between the two variables.
- \(OR\) is mainly used for categorical traits and is usually given as the result of a logistic regression (or case-control GWAS). All odds ratios will be 0 or above. An \(OR\) of 1 represents no effect of the SNP on outcome, \(OR\) < 1 reflects a decreased change of an outcome, and \(OR\) > 1 reflects an increased change of the outcome.
Worked Example
These are examples of what your target (genetic) and discovery (sumstats) data will look like. The genotype data will often be in the form of plink format .bim/.bed/.fam files, which cumulatively give one half of the information required for PGS calculation. This is illustrated below (note, this is a representation of that information - the files do not look like this).
| ID | rs123456 | rs234567 | rs345678 | β¦ | rs456789 |
|---|---|---|---|---|---|
| Indv_1 | 1 | 2 | 0 | β¦ | 1 |
| Indv_2 | 2 | 0 | 1 | β¦ | 0 |
| Indv_3 | 0 | 1 | 0 | β¦ | 2 |
| β¦ | β¦ | β¦ | β¦ | β¦ | β¦ |
| Indv_100 | 1 | 2 | 1 | β¦ | 1 |
An abridged version of GWAS sumstats is illustrated below, containing effect sizes and p values for each SNP in the GWAS. Often, GWAS summary statistics will be in different formats with little differences between them. This is fine, and only requires a little bit of manipulation before they can be used in PRS calculation (most of this can be done in the command line on the HPC).
| SNP | Effect Size (Ξ²) | p β value |
|---|---|---|
| rs123456 | 0.25 | 0.560 |
| rs234567 | 0.88 | 0.004 |
| rs345678 | -0.5 | 0.990 |
| β¦ | β¦ | β¦ |
| rs456789 | 0.01 | 0.0001 |
Therefore, to calculate a PGS for Indv_1, the equation would look like this:
\(PRSIndiv_1 = (n_1 * Ξ²_1) + (n_2 * Ξ²_2) + (n_3 * Ξ²_3) + β¦ + (n_i * Ξ²_i)\)
\(PRSIndiv_1 = (1 * 0.25) + (2 * 0.88) + (0 * -0.5) + β¦ + (1 * 0.01)\)
\(PRSIndiv_1 = (0.25) + (1.76) + (0) + β¦ + (0.01)\)
This is then done for all individuals in the dataset:
\(PRSIndiv_1 = (n_1(_1) * Ξ²_1) + (n_2(_1) * Ξ²_2) + (n_3(_1) * Ξ²_3) + β¦ + (n_i(_1) * Ξ²_i)\)
\(PRSIndiv_2 = (n_1(_2) * Ξ²_1) + (n_2(_2) * Ξ²_2) + (n_3(_2) * Ξ²_3) + β¦ + (n_i(_2) * Ξ²_i)\)
\(PRSIndiv_3 = (n_1(_3) * Ξ²_1) + (n_2(_3) * Ξ²_2) + (n_3(_3) * Ξ²_3) + β¦ + (n_i(_3) * Ξ²_i)\)
\(PRSIndiv_n = (n_1(_n) * Ξ²_1) + (n_2(_n) * Ξ²_2) + (n_3(_n) * Ξ²_3) + β¦ + (n_i(_n) * Ξ²_i)\)
Fortunately, this is not done by hand. There are a range of software packages out there to help you do this, most of this which have their own slightly different method of calculating. This is because there are several problems inherent to PGS, and each try to get round them in different ways.
(Key) Assumptions of Polygenic Scores
As with most statistical methods, there are several assumptions that are made when we calculate polygenic scores. Two of the most important are that of independent SNPs and additivity.
- Independent SNPs - controlling linkage disequilibrium (LD) by removing SNPs in high LD allows you to retain only independent SNPs in your analysis. This is done as we only want independent predictors of a trait in our analysis. When performing a GWAS, LD can result in false positives, identified by a strong peak - these are SNPs that are seemingly associated with a trait but rather, are just in LD with a true causal SNP. If we did not correct for LD, we would be overestimating the effects of common genetic variables on our trait of interest. We would essentially be counting the effect of the true causal SNP multiple times (as many times as there are SNPs in LD with the true causal SNP). If PGS are calculated based on independent SNPs then the distribution of scores should be close to normal.
Visualising LD. Figure from DOI: 10.1093/jxb/eraa480
- Additivity - This suggests that the effects of SNPs are independent from each other and so the effects of all these common variants is solely additive. This is a theoretically nice idea - and allows us to calculate PGS by adding up \(effect * n\) for all SNPs. However, its something that is unlikely to occur in practice due to interactions between these genetic variants. Additivity simplifies this analyses, but its important to consider that PGS will not account for any multiplicative effects and interactions (i.e., PGS = additive, common genetic liability).
Linkage Disequilibrium arises when there is a difference between the combination of alleles, we see vs the combination of alleles that we expect given random assortment. In other words, we see correlations between genotypes at different SNPs, with these blocks of linked SNPs being referred to as haplotypes. Linkage generates LD as SNPs in close together regions in the genome are usually inherited together. Recombination breaks down LD. This occurs during meiosis and mixes up maternal and paternal material so the resulting sex cells contain combinations of the two. These crossover events create new allele combinations.
Two methods can be used to deal with linkage disequilibrium:
- Pruning uses the first SNP (in physical genome order) and computes its correlation with the following SNPs within the specified window size. When a large correlation (above a specified threshold) is found, it removes one SNP from the correlated pair, keeping the one with the largest minor allele frequency. This continues with the next SNP that has not yet been removed. In a worst-case scenario, this algorithm may remove all but one SNPs from the data.
- Clumping is considered a more βintelligentβ way of removing SNPs in LD. First, SNPs are sorted by importance. This is often done using a statistic of some sort (e.g., effect sizes, p values, or MAF). It takes the first SNP (e.g.Β most important SNP by whatever metric used) and removes SNPs that are highly correlated it within the given window. Then it goes on with the next most significant SNP that has not been removed yet. Using this method, there is always going to be at least one SNP representative of the region covered by the window size.
Using Polygenic Scores
Polygenic risk scores represent the polygenic (common genetic variation) component associated with a given phenotype (disease, disorder, trait).
Often, we use these scores in regression models to test associations between this common genetic liability and a given outcome variable. Generally, we use linear regressions for continuous outcomes and logistic for binary outcomes. However, there are lots of other types of regression too.
These are some examples of associations we might test. What kind of model might we use for:
- Educational outcomes measured as degree (yes/no)
- Experience of medication side effects (yes/no)
- Income (yearly income)
- Educational outcomes measured as most advanced level of education (GCSEs, A-level, UGR, PGR etc etc)
Polygenic Scores in the Literature
Background
The paper below is a good example of using polygenic scores in research, here the authors look at associations between schizophrenia PGS and measures of schizophrenia symptom severity.
Excerpt from Legge et al., (2021)
Methods
The study uses three different samples, all collected by Cardiff University and all primarily individuals with schizophrenia or schizoaffective disorder depressed type. Schizophrenia and schizoaffective-disorder depressed type are often included together to help to improve sample size and thus, statistical power. This is generally fine as the disorders are fairly similar both in terms of symptoms but also how theyβre treated.
This study looks at lifetime worse schizophrenia symptom severity and cognition. The raw data for these variables were measured with a variety of tests including SAPS, SANS, (Scale for the Assessment of Positive and Negative Symptoms), and the MCCB (Matrics Consensus Cognitive Battery). This aims to measure cognition across 7 domains (using 10 tests): SoP, Attn, WM, Verbal and Visual Learning, Reasoning/Problem Solving, Social Cognition. Based on these individual measures, a Confirmatory Factor Analysis was used to derive 5 phenotype dimensions relating to lifetime worst symptom severity in positive symptoms, negative symptoms of diminished expressivity, negative symptoms of reduced motivation and pleasure, disorganised symptoms, and cognitive ability.
All individuals were genotyped using the Illumina OmniExpress v8/v12
The methods, particularly with respect to the genetics, are very clear. Briefly:
A Principal Component Analysis (PCA) was conducted so that the resulting genetic principal components could be used to exclude people of non-European ancestry and to control for population stratification in regression models later on. > βGenetic principal components representing ancestry were derived using PLINK, version 2.030 using single-nucleotide variants with low levels of linkage disequilibrium (r2 <0.2 and 500kilobyte window; criteria used by the Psychiatric Genomics Consortium)β Legge et al., (2021).
Genotype data was prepared to remove SNPs in high LD. > βPolygenic risk scores were calculated using PRSice following a widely applied method and using default parametersβ. Legge et al., (2021).
PGS were calculated using PRSice and sumstats from the PGC3 Schizophrenia GWAS. > βCriteria used by the Psychiatric Genomics Consortium were applied to select common single-nucleotide variants (minor allele frequency, >0.10) of high-quality (imputation score, >0.9) in relative linkage equilibrium (r2 <0.2, 500-kilobyte window) that were present in all 3 samplesβ. Legge et al., (2021).
Results
Table 1 from Legge et al., (2021) showing sssociation of schizophrenia PRS and phenotype dimensions From 5-factor model.
Five regression models were fit, in each instance a schizophrenia
phenotype dimension was included as the outcome variable. This was
regressed against schizophrenia PGS, alongside age, sex, and the first
five genetic principal components. For example,
positive ~ PGS + age + sex + PC1 + PC2 + PC3 + PC4 + PC5.
The output table can be interpreted by examining the effect size (\(\beta\)), confidence intervals, p values, and the R^2. The R^2 represents the amount of variance in each phenotype that is attributable to the variables in the model. The effect size demonstrates the strength and direction of the relationship between schizophrenia PGS with each symptom dimension.
These were corrected for multiple comparisons using the Bonferoni correction. Following this, only the disorganised and cognitive dimensions remained significantly associated with schizophrenia PGS.
Discussion
In summary, schizophrenia PGS is associated with some schizophrenia symptom dimensions. Increased schizophrenia PGS was associated with increased scores on the disorganised symptom dimension (i.e., greater severity of lifetime worst disorganised symptoms). On the other hand, increased schizophrenia PGS was associted with a decreased score on the cognitive dimension (this reflects reduced cognitive performance). These cognitive and disorganised phenotypes could be used as markers of schizophrenia common genetic liability.
Applications and Limitations
Applications
Risk stratification
One of the hopes for Polygenic Scores is that we should be able to use PGS to identify people who are at higher or lower genetic risk to certain disorders. This could then be used to help target mitigation strategies (i.e., prevention or screening) towards more at-risk groups. For example, if you had calculated PRS for risk of Coronary Artery Disease, then you could focus resources on the high-risk group, encouraging them to engage in lifestyle interventions, such as physical exercise and improved diet. Unfortunately, this is what we hope can be accomplished in the future, we arenβt yet at the stage where we can use PGS to provide evidence-based recommendations in clinical settings. One of the reasons for this is that PGS only capture common genetic liability. Most of the time, genetics is not the whole story, with environmental factors also affecting the phenotype. Secondly, polygenic scores only capture genetic risk that comes from a certain part of the genome β common variants of small effect. It does not take into account risk from other sources, for example, genetic variation that carries a greater risk burden but are rarer in the population such as CNVs.
Improved diagnosis/prognosis
Previously for some disorders, single mutations of large effect have been used to identify people at greater risk of disorder or disease. However, it turns out that when aggregated into PRS, the SNPs of small effect that we capture can have impacts comparable to, or greater than these single mutations. Therefore, we can identify a greater proportion of at-risk people using these scores than with these rare monogenic factors alone. An example of this is risk prediction in breast cancer.
Ordinarily the genes BRCA1 and BRCA2 protect against the development of certain cancers (e.g., breast, ovarian, and others). However, some people have a mutation in this genes which prevent them from working as they should, and these people can be at higher risk of developing breast cancer. As well as screening for specific variants in this gene, it is also possible to employ PRS to improve predictive ability. Based on a GWAS of breast cancer, researchers found that people in the top 1 percentile of breast cancer PRS had 4-fold greater risk of developing a certain type of breast cancer (EHr +) and those in the bottom 1 percentile had a 6-fold reduced risk. So in this way, by looking at the entire genome (not just within the BRCA genes), you may be able to get a more comprehensive idea of breast cancer risk (Mavaddat et al., 2019).
It is also a lot easier to ascertain PRS than these specific mutations on account that it requires just a genotyping array. In contrast, looking at rarer variants might require sequencing of specific parts of the genome. Sequencing is less accessible than genotyping alone, which only costs around $100 and a blood or saliva sample per person (Khera et al., 2018). With genotyping, these results can then be used to calculate PRS for as many disorders or phenotypes as you desire, as opposed to looking only at these monogenic variants which would require specifically looking at mutations within certain genes relevant to each specific disease of interest.
Understanding the disorder
Pleiotropy is the idea that a single gene can impact upon two or more different phenotypes or affect multiple traits. PRS can be a useful tool for examining the extent of pleiotropy across phenotypes or psychiatric disorders as it allows us to test genetic overlap or in other words identifying shared genetic underpinnings of certain disorders.
Treatment development
It is hoped that polygenic risk scores might one day be incorporated into clinical decision-making processes, specifically with relation to developing personalised treatment plans for patients. This might include assigning more aggressive treatment plans to people with higher genetic liability than those with lower genetic risk. That being said, hopes are primarily rooted in PRS being used for the early detection/prevention/intervention components of precision medicine. GWAS results are a useful bit of information to help guide general drug development, however we just canβt yet translate it via PGS to an individualised level.
Limitations of PGS
Datasets
One βlimitationβ is that you need two datasets to calculate PGS, one which is the individual genotype data for your sample, and one which you perform a GWAS on to get your summary statistic information. Using the same sample for your training (GWAS) and target (genotype) samples risks the PGS model only working well on that one dataset, when what we really want is for it to be generalisable to new and unseen data too. In other words, we want to be able to see how well those SNPs predict a given phenotype in people not from the original sample. In real terms, this is not really a problem thanks to the availability of GWAS summary statistics, meaning that you donβt need an in-house second dataset, but only being able to source an appropriate GWAS that would fit with your work (which can be fairly easily done online).
Ancestry
People of European decent make up about 16% of the worldβs population. However, we can see that they are massively over-represented in GWAS research. This is both within the studies themselves, and if we consider individuals in GWAS. As we know a combination of GWAS and genotype-level data are used in PGS calculation, with the GWAS sample ancestry needing to be matched to this individual level data. This means that if we use these large, majority-European GWAS summary statistics for people of non-European ancestries, then the capacity for prediction is not going to be quite as good as if the ancestry matched.
Figure showing GWAS study/participants ancestry from https://doi.org/10.1016/j.cell.2019.02.048.
Fortunately, the field of genetics is moving forward and there is renewed effort to build large genetic samples of non-European participants. One example of this is the Biobank of the Americas which focuses on sampling people from under-represented populations, particularly Latin American ethnic groups. Despite the many people living within these countries, less than 1% of genetic samples come from latam. This project aims to actively recruit participants from these areas, providing free genetic testing kits, in order to firstly better understand genetic contributions to disorder and disease across populations, and secondly, combat against the widening health inequalities that would arise if the data from these communities were not integrated into our studies. Another point to note is that now, algorithms for PRS calculation are aiming to be more flexible with regards to sample ancestry. For example, PRScs, which is one of the programs weβll have a go with later, also has another version called PRScsx which aims to better deal with cross-population prediction.
Ethics
Genetic discrimination is the idea that genetic information (e.g., PGS) may be used to βjustifyβ discrimination against people. For example, if employers got a hold of this information they may attempt to employ βhealthierβ individuals less likely to take sick days, and likewise health insurance providers might be less likely to insure people with high-risk scores for disorders or diseases, or indeed attempt to charge higher monthly costs.
Another example is the use of PGS to screen embryos prior to implantation β something that the European Society of Human Genetics (ESHG) denounced as an βunproven, unethical practiceβ. The ESHG also warned that βWhen PRS assessments are provided as direct-to-consumer tests their evaluation of a patientβs risk may be dangerously incomplete and can lead to grave misunderstandings.β
Despite this, very recently a private health company started offering fertility clinics WGS of IVF embryos for $2.5k/embryo. To clarify, this is not just looking at single-gene variants, for example within the CFTR gene to test for cystic fibrosis, but rather the entire genome, and for the primary purpose of calculating PRS to βfind the embryo at lowest risk for the disease that runs in your familyβ. PRS, as we mentioned are not reflective of the sole risk factor involved in disorder/disease development, nor even the sole genetic risk factor. The whole idea also sounds very much in the eugenics area.
The PGC has some of the biggest collections of GWAS summary statistics that could be used in analyses such as these, despite this very much violating their terms of use. This story prompted very swift backlash from the PGC who reinstated that the goal of all their genetics research is to improve the lives of people with mental illness, not stop them from being born. Unfortunately, while they can offer guidance on how this work should be used there are no laws in place to prevent similar unethical practices from taking place. Therefore, it is incredibly important to be aware of the potential consequences or interpretations of any genetics research you take part in.
Overfitting and Linkage
This will be covered in greater detail in the practical components as the two softwares deal with these issues in different ways. As a brief summary:
- Overfitting is the idea that the effect sizes we see in GWAS sumstats for the top hits might be larger than what the true effect might actually be.
- Linkage is the idea that some SNPs are preferentially inherited together and so exhibit LD or linkage disequilibrium (i.e., making them not-independent).
PRSice Practical
Background
PRSice has been one of the most popular methods of calculating polygenic scores. It relies on a method known as clumping and thresholding (C & T) to solve some of the problems inherent to PGS.
- Clumping is used as we want independent SNPs. For a recap of what clumping is, see above.
- Thresholding is used to maximise the predictive
ability of scores.
- PGS came about because of the understanding that common SNPs that do not reach GWS may still have predictive value. Therefore, the p value thresholds are cut-offs for the number of SNPs we include in the PGS. A less stringent threshold means that more SNPs are included in the calculation (or all of them in the case of p = 1). A more stringent threshold means that less SNPs are included in the analysis.
- There is no universally optimum threshold, the best threshold will
vary for a given phenotype based on what we know about its genetic
architecture.
- If a trait is highly polygenic, lots of common variation will contribute to the phenotype therefore a higher p value threshold will often be selected to include more of these SNPs.
- If a trait is more oligogenic, a more stringent threshold will be selected as we know that not as many variants are involved.
- Including too many or too few SNPs in the calculation may both be detrimental as you risk introducing too much noise or insufficient SNPs into your variable, respectively. However, threshold selection is not just theory driven - in PRSice, many PGS at different thresholds will be calculated. The best fit score is the one most predictive of the phenotype as determined by the largest R2 is usually selected for further analysis.
Practical
The PRSice2 practical will involve us calculating PGS for Alzheimerβs Disease (AD). Please see here for a guide to calculating PGS from SW Choi, the author of the PRSice package
1. Set up file structure and copy files to the appropriate directories:
Files can be obtained from
/nfs/neurocluster/databank/Teaching/MET587/PRS on HAWK.
This file structure will be great for both the PRSice2 and the PRScs
practicals (those that are crossed out arenβt relevant to this part of
the practical, however, its still worth bringing them across):
π¦ PRS
β£ π data
β β π validation.bim/.bed/.fam
β β
π training.assoc.logistic
β β π PGC3_sz_sumstats.txt
β β π high-ld-b37.ranges
β β π ldblk_1kg_eur
β£ π Code
β£ π Results
β β π pca
β β π prsice
β β
π prscs
β£ π PRScs
β£ π
ve_prscs
β£ π code_examples
The code_examples directory has 3 example scripts in it, corresponding to different parts of the practical:
- 1_generate_pcs.sh
- 2_runprsice.sh
- 3_run_prscs.sh
2. GWAS selection
For this practical the GWAS is already provided
(training.assoc.logistic). The sumstats used are from a
case-control GWAS of AD, and was conducted on a different sample from
our genotype files, but of the same ancestry. As mentioned before, if no
sumstats files are provided then you can usually find suitable data on
public repositories.
Before continuing, check that the reference alleles match between
GWAS and genotype files. This can be checked by using the command
grep rsid file.name in the command line. This allows us to
check whether the A1 in both files refers to the same strand; we want A1
in the GWAS to match A1 in the .bim file (which usually refers to the
minor allele). If there is a mismatched, then the 5th and 6th column of
the .bim file need to be swapped.
3. Calculate Genetic Principal Components
You must control for population stratification in analyses using PGS
- we do this by including genetic principal components in our analysis
(i.e., 1_generate_pcs.sh). First up is the initial QC:
# initial qc
module load plink/2.0
plink2 --bfile ../data/validation --maf 0.1 --geno 0.95 --mind 0.95 --hwe midp 0.000001 keep-fewhet --make-bed --out ../data/validation.qc
- maf 0.1 β> restricts to SNPs with a minor allele frequency over 10%.
- geno 0.95 β> restricts to SNPs with a genotyping rate over 95%.
- mind 0.95 β> restricts to individuals with less than 5% SNP missingness.
- hwe mid-p β> filters out variants with hwe exact test p value below the threshold (0.000001). The mid-p is a recommended modifier to add (see plink documentation for more information).
- keep-few-het β> prevents test from failing on normal variants when population stratification present (see plink documentation for more information).
Then, remove SNPs in LD. First, by identifying and then excluding those in regions of high LD using the .ranges file:
module load plink/1.9
plink --bfile ../data/validation.qc --make-set ../data/high-ld-b37.ranges --write-set --out ../data/temp/validation.noLD
module purge
module load plink/2.0
plink2 --bfile ../data/validation.qc --exclude ../data/temp/validation.noLD.set --make-bed --out ../data/temp/validation.noLD
- make-set β> defining SNP list based on base pair regions
- write-set β> writing out this SNP list
Then, we prune to identify and remove SNPs in LD:
plink2 --bfile ../data/temp/validation.qc --indep-pairwise 500 10 0.2 --out ../data/temp/validation.qcp
- indep-pairwise β> identifies SNPs not in LD using a window size of 500, step size of 10, and a r of 0.2.
plink2 --bfile ../data/temp/validation.qc --extract ../data/temp/validation.qcp.prune.in --make-bed --out ../data/temp/validation.qcp
- extract β> extracts SNPs identified from pruning above from the input genotyping files.
Then, caluclating principle components:
plink2 --bfile ../data/temp/validation.noLDp --pca 10 --out ../data/temp/validation.noLDp
- pca β> command for running principle component analysis. number is how many principle components you want returned (default 20, I think).
4. Calculate PGS Then we want to use PRSice2 to calculate our Polygenic Scores (i.e., as in 2_run_prsice.sh)
module load prsice
PRSice.sh --base data/training.assoc.logistic --target data/temp/validation.qcp --thread 1 --stat BETA --binary-target T --no-regress --bar-levels 0.00001,0.0001,0.001,0.01,0.05,0.1,0.5,1 --fastscore --out results/PRSice
- base β> sumstats file
- target β> genotype file
- stat β> effect size used in sumstats file
- binary-target β> whether the phenotype in sumstats file is binary or not (T/F)
- no-regress β> we just want the scores for now
- bar-levels β> setting thresholds of interest
- fast-score β> score for only these thresholds
Note: if .bim/.bed/.fam files are in a per chromosome format, you can either merge them all together first, or you can replace the chromosome number in the file name with a #, and this will run all of them for you in one command.
Output files include a .log file, a .prsice
file, and an .all.score file.
.logis good to inspect to ensure your code ran as expected..prsicewould contain results from regression analyses (however, we specificedno-regress).all.scorecontains scores for our thresholds of interest. Not specifying thresholds / using--fast-scoreand instead using--all-scorewill result in a much larger file being produced.
Note: if this is ran per chromosome then you will get one file each
per chromosome. The *.all.score files need to be summed
before use in analyses.
5. Use PGS in your analyses
Here is an example of how we might use PGS. This code loads in your PGS generated with PRSice2 and finds the best-fit PGS. Then, we use this score to test whether its associated with the disease phenotype using a logistic regression model.
# set up
library(readr)
library(tidyverse)
library(datawizard)
library(ggplot2)
library(performance)
# set your working directory (where you stored your PRSice results)
setwd(dir = "D:/siobh/Documents/Uni/PhD/Y2/Other Stuff/Teaching/MET587 PRS/results/")
# load in the .all.score file (PRSice output)
prsice.out <- read.table("prsice/PRSice.all.score", header = TRUE, sep= "") %>%
dplyr::select(-"FID")
colnames(prsice.out) <- c("IID", "prs_1e-5", "prs_1e-4", "prs_0.001",
"prs_0.01", "prs_0.05", "prs_0.1", "prs_0.5", "prs_1")
# validation file - to get the phenotype of interest
validation.fam <- read.table("validation.fam", header = FALSE, sep= "")
colnames(validation.fam) <- c("FID", "IID", "FIID", "MIID", "SEX", "PHENO")
# selecting only the columns we are interested in
validation.fam <- validation.fam %>% dplyr::select(c("IID", "SEX", "PHENO"))
# fam files coded 1 = control, 2 = case
# so recode pheno to 0/1 for logistic regression
validation.fam$PHENO[validation.fam$PHENO == 1] <- 0
validation.fam$PHENO[validation.fam$PHENO == 2] <- 1
# load in the principal components - but only reading in first 5 PCs
pcs <- read.table("validation.noLDp.eigenvec", header = FALSE, sep= "")[2:7]
colnames(pcs) <- c("IID", "PC1", "PC2", "PC3", "PC4", "PC5")
# merge data together (fam file and pcs)
df <- left_join(validation.fam, pcs, by = "IID")
As we have a binary outcome, we cannot directly calculate R2. Instead we need to use a pseudoR2 such as Naglekerkeβs R2. If we had a quantiative outcome variable, R2 is calculated as part of a standard linear model.
# Finding best fit PRS (the one with the largest R2
# aka the one that explains the most variance in the phenotype)
# First, we calculate the null model (model without PRS, only covariates)
# using a logistic regression (as phenotype/case-control status is binary outcome)
null.model <- glm(PHENO~., data=df[,!colnames(df)%in%c("IID")],
family = binomial(link = "logit"))
# You cannot really get an R2 from a logistic regression
# in the same way that you can from a linear model.
# Therefore we use the pscl package and the pR2 function to get a pseudo-r2 instead.
null.r2 <- r2_nagelkerke(null.model)
# loop through each PRS finding the variance explained by each PRS (model R2 with PRS - model R2 without PRS)
# save results into prs.result_ice
prs.result_ice <- NULL
for(i in 2:ncol(prsice.out)){
# merge the releveant PRS from the prsice.out file
pheno.prs <- merge(df, prsice.out[,c(1,i)], by= "IID")
# Now perform a logistic regression on the FAM phenotype with PRS and the covariates
# ignoring the ID variable from our model
model <- glm(PHENO~., data=pheno.prs[,!colnames(pheno.prs)%in%c("IID")],
family = binomial(link = "logit"))
# model pseudo-R2 is obtained as
model.r2 <- r2_nagelkerke(model)
# R2 of PRS is simply calculated as the model R2 minus the null R2
prs.r2 <- model.r2-null.r2
# We can also obtain the coeffcient and p-value of association of PRS as follow
prs.coef <- summary(model)$coeff[8,]
prs.beta <- as.numeric(prs.coef[1])
prs.se <- as.numeric(prs.coef[2])
prs.p <- as.numeric(prs.coef[4])
# We can then store the results to make a nice output data frame
prs.result_ice <- rbind(prs.result_ice,
data.frame(Threshold=names(prsice.out[i]), R2=prs.r2[[1]],
P=prs.p, BETA=prs.beta, SE=prs.se))
}
We identify the best fit PGS by using the R2 value. The distribution of the best-fit PGS is normal, suggesting we havenβt violated assumptions of independence.
# Best result is (i.e., the p value threshold at which
# PRS explains the most variance in the outcome):
prs.result_ice[which.max(prs.result_ice$R2),]
# merge genetic and pheno data
df <- left_join(df, prsice.out, by = "IID")
# look at the distribution of best fit PRS
hist(df$prs_0.01)
Note: Genetic principle components need to be controlled for at some point. This can either be during PGS calculation or during analysis (i.e.,. here).
In this instance, our PGS are not significantly associated with the outcome variable (AD case-control status). This suggests little evidence that the common genetic variants are important to AD development (this is just an example dataset though (!)).
# finally look at the model itself (with the best-fit PRS as predictor)
model1ice <- standardise(glm(formula = PHENO ~ prs_0.01 +
PC1 + PC2 + PC3 + PC4 + PC5 +
SEX, data = df, family = binomial(link = "logit")))
# is the PRS significantly associated with the outcome variable?
summary(model1ice)
##
## Call:
## glm(formula = PHENO ~ prs_0.01 + PC1 + PC2 + PC3 + PC4 + PC5 +
## SEX, family = binomial(link = "logit"), data = data_std)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.61790 0.06398 -40.917 < 2e-16 ***
## prs_0.01 0.07404 0.04502 1.645 0.100
## PC1 -0.83177 0.04180 -19.900 < 2e-16 ***
## PC2 -0.50499 0.03842 -13.143 < 2e-16 ***
## PC3 0.05279 0.04469 1.181 0.238
## PC4 0.06161 0.04426 1.392 0.164
## PC5 -0.21190 0.04476 -4.734 2.2e-06 ***
## SEX -0.95829 0.06006 -15.955 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4380.5 on 5807 degrees of freedom
## Residual deviance: 3380.7 on 5800 degrees of freedom
## AIC: 3396.7
##
## Number of Fisher Scoring iterations: 6
PRScs Practical
Background
PRScs aims to get around some of the other problems inherent to PGS calculation, namely the Winnerβs Curse. This is the idea that in an auction, the winning bid will be one that overestimates the items value. In a GWAS, this is also known as over-fitting and it means that we will likely see an exaggeration of SNP effect sizes of these top hits in samples that find these associations β this effect may also be exacerbated by testing at multiple thresholds which occurs in traditional C+T methods.
PRScs uses a continuous shrinkage prior (set with the parameter phi) to try to mitigate this problem. This prior adjusts the SNP effect size estimate given in the sumstats on the basis of its association signal. In this way, the same penalty isnβt applied universally across SNPs. Effect size shrinkage will be larger for more significant SNPs than it will for those which have higher (βless significantβ) p values.
Running PRScs gives you the estimated effect sizes with the cs prior applied. You then use these, instead of the effect sizes from your GWAS, to score your participants the usual way you would in PLINK.
PRScs has three settings that require alteration: ITER,
BURNIN, and phi.
The first two relate to how the probability distribution (of the βtrueβ SNP effect size) is sampled. PRScs uses MCMC (Markov chain Monte Carlo) to do this. This means that it uses a Markov chain (i.e., a random string of numbers) to do a Monte Carlo approximation.
The Markov chain starts at a completely random point. The numbers in the chain change and these new numbers are compared to the old ones. The change is accepted if these new parameters explain the data better than the old ones. The Markov chain continues to update itself in this way, getting closer and closer to the data until it converges.
Figure from https://blog.revolutionanalytics.com/2013/09/an-animated-peek-into-the-workings-of-bayesian-statistics.html
This can be imagined a little like a pirate looking for treasure
(where the treasure represents the probability distribution of the true
effect sizes). Each dash is a step, or an update to the Markov chain
(also known as iterations / ITER).
If you have too few iterations, then you arenβt going to allow the pirate to get close to the treasure, or the chain to get close to the true probability distribution. If you have too many, then your pirate will be in roughly the right place but will get tired from walking around too much (i.e., itβll be overly computationally demanding and time consuming - wasting too many resources). After the pirate is roughly in the same area, and the numbers in the chain arenβt changing so much then the model is described as converged.
The default settings in PRScs for ITER is 1,000;
however, in reality we tend to use a lot more (~25,000)
The second setting is BURNIN. This is the number of
iterations you ignore. If you asked the pirate to go back and tell her
friends where she thinks the treasure was hidden, she wouldnβt retrace
her steps entirely, but rather go straight to the area she thinks its in
based on her previous expedition. Likewise, the early iterations are the
Markov chain are probably not going to be very close to the distribution
we are trying to infer, but the later ones will be. Each iteration
before your BURNIN setting you ignore, and each after you
keep.
The default settings in PRScs for BURNIN is 500.
Generally, as a rule of thumb you can generally ignore the first third.
If we used a ITER of 25k, then we might use a
BURNIN of 10k.
In reality, it isnβt as neat as this example. The steps taken are random and the chain is memoryless, so things might be more likely to look like this instead.
Figure demonstrating convergence of Markov chains.
Here, we have multiple chains converging. The x axis represents the
number of ITER, and the dotted line represents
BURNIN, or what we ignore. These iterations at the end are
going to be close to what is the true effect size probability
distribution. So this is what we want to take forward and use to adjust
our SNP effect size estimates.
Practical
The PRSice practical will involve us calculating PGS for an unknown phenotype).
1. Set up file structure and obtain files (same structure as above but repeated as a reminder):
Files can be obtained from
/nfs/neurocluster/databank/Teaching/MET587/PRS on HAWK.
Again, anything struck through isnβt required for this bit.
π¦ PRS
β£ π data
β β π validation.bim/.bed/.fam
β β
π training.assoc.logistic
β β π PGC3_sz_sumstats.txt
β β π high-ld-b37.ranges
β β π ldblk_1kg_eur
β£ π Code
β£ π Results
β β π pca
β β π prsice
β β
π prscs
β£ π PRScs
β£ π ve_prscs
β£ π code_examples
2. GWAS selection
We cannot use the same sumstats as we previously used for the AD PGS practical in PRSice2. This is because the susmtats have been formatted for the practical and lack information on the A2 allele, which is needed for running PRScs.
Instead, we are going to use the phase 3 schizophrenia sumstats from
the PGC. These can be downloaded from here,
however, a formatted version is available in the project directory as
PGC3_sz_sumstats.txt.
3. LD Panel Selection
PRScs does not require you to remove linkage disequilibrium in your data prior to the analysis. This is because PRScs uses a linkage disequilibrium panel to model LD as part of its process. Crucially, the LD panel needs to be of matched ancestry to your samples - for this practical we are using the european 1000 genome project reference panel.
4. Download PRScs
There is a PRScs module on HAWK already; this can be loaded up using
module load prscs. Alternatively, you can clone it to your
account using:
git clone https://github.com/getian107/PRScs.git. For
convenience, it is available in the MET587 folder on HAWK so this can be
copied across with the other materials. Before it works you either need
to activate your virtual environment or anaconda - this is shown in the
script below. Both options are fine, just use whichever you prefer.
5. Calculate Principal Components
We need our PCs here again, so we can just use the same ones we calculated above (1_generate_pcs.sh).
6. Running PRScs
There are several things which differ between using PRSice2 and PRScs:
- We are using the array feature of slurm to allow us to run all
chromosomes in parallel (see
3_run_prscs.sh).
#SBATCH --array=1-22is added to the header and this allows us to loop through numbers 1 to 22, substituting the argument${SLURM_ARRAY_TASK_ID}for these numbers. This also means we want to split our target data by chromosome (and after score calcualtion, we need to add these up in R). - We do not need to prune the target data as PRScs sorts this βin houseβ using the LD panel. We only do light QC instead here.
- PRScs does not output PGS, but rather effect sizes. The prior that PRScs uses is the continuous shrinkage prior, meaning that effect size shrinkage is not uniformly applied to SNPs but rather depends on the strength of the signal in the GWAS. Related to this is the final setting, the global shrinkage parameter phi. The parameter is set based on the phenotype youβre studying and the assumptions you make about its underlying genetics. While you can just leave it for the program to sort out (when you have large GWAS samples on a polygenic trait), its usually best to set it yourself. The PRScs documentation suggests 1e-2 (for highly polygenic traits) or 1e-4 (for less polygenic traits). Or you can play around with a few and see what gives you the best results. For schizophrenia we tend to set it to 1 because itβs a v complex polygenic disorder, however you could set it to auto too (for example, if youβre calculating lots of PRS and want to be consistent across your pipelines).
- PRScs has several requirements that are not already present in python (i.e., scipy, h5py), so for that reason you can load up anaconda to easily sort this. Alternatively, you can create a python environment and load this each time you want to use PRScs. When it comes time to use PRScs, you will want to make sure you have activated the environment (examples of both are shown in 3_run_prscs.sh).
#!/bin/bash
#SBATCH --account=scwXXXX
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -p c_compute_neuro1
#SBATCH --array=1-22
#SBATCH --time=24:00:00
#SBATCH --mem=64G
# initial QC
module load plink/2.0
plink2 --bfile data/validation --geno 0.95 --mind 0.95 --hwe midp 0.000001 keep-fewhet --make-bed --out data/validation.qc
# split validation by chrom
module load plink/1.9
plink --bfile data/validation.qc --chr ${SLURM_ARRAY_TASK_ID} --make-bed --out data/validation_${SLURM_ARRAY_TASK_ID}; \
module load anaconda/2020.02
source activate
or
module load python/3.9.2
source ve_prscs/bin/activate
pip install -r ve_prscs/requirements.txt
# run prscs (using really low iter/burnin parameters for speed)
python3 PRScs/PRScs.py \
--ref_dir=data/ldblk_1kg_eur \
--bim_prefix=data/validation_${SLURM_ARRAY_TASK_ID} \
--sst_file=data/PGC3_sz_sumstats.txt \
--n_gwas=320404 \
--chrom=${SLURM_ARRAY_TASK_ID} \
--n_iter=100 --n_burnin=50 \
--out_dir=results/cs
7. Scoring SNPs
Once we have our new effect sizes (with the continuous shrinkage prior applied), we can then score our alleles. We specify our input genotype file as usually done in PLINK and what filters we want to use (such as MAF or INFO scores) and instead of using sumstats files as you would usually, we instead use the results from PRScs (specifying the columns from which PLINK should read the variant IDs, the allele codes, and the effect sizes). Finally,we add the βsumβ modifier on the end which reports sums instead of averages of per-allele scores (which is the default).
# score per chrom
plink \
--bfile data/validation_${SLURM_ARRAY_TASK_ID} \
--score results/cs_pst_eff_a1_b0.5_phiauto_chr${SLURM_ARRAY_TASK_ID}.txt 2 4 6 sum\
--out results/chr${SLURM_ARRAY_TASK_ID}
8. Use PGS in your analyses
Here is an example of how we might use PGS. This code loads in your PGS generated with PRScs, and adds up per-chromosome scores to get per-individual PGS. Then, we use this score to test whether its associated with the disease phenotype using a logistic regression model.
# set up
library(readr)
library(tidyverse)
library(datawizard)
library(ggplot2)
library(performance)
# load in .profile files (PRScs output - full set provided in HAWK)
# find all .profile files
setwd("results/prscs/")
list_file <- list.files(pattern = "*.profile") %>%
gtools::mixedsort()
# create function to sum across chromosomes
analyze <- function(threshold) {
#prepare
dat <- lapply(threshold,read.table, header = TRUE, sep= "")
my_cols <- c("FID", "SCORESUM")
dat <- lapply(dat, "[", , my_cols)
#sum
sum <- bind_rows(dat) %>%
group_by(FID) %>%
summarise(ALL_CHR_SUM = sum(SCORESUM))
}
# apply function to the list of *.profile files
prs_cs <- analyze(list_file)
# rename columns so nice and clean
colnames(prs_cs) <- c("IID", "PRScs")
# look at distribution of prs
hist(prs_cs$PRScs)
# standardising doesnt change the distribution, just makes it more meaningful.
# i.e., a change in 20 'points' of PRS doesn't mean anything but a change in 2 SD does
hist(standardise(prs_cs$PRScs))
# we used schizophrenia sumstats for this practical as the training.assoc.logistic
# file didn't have the right columns in it. the pheno variable in the genotype
# files probably reflects alzheimer's disease case-control status.
# here we are generating a fake variable reflecting schizophrenia case-control
# status, where 1 = case and 0 = control. We have set the probability so that
# there is a higher probability of a case being sampled (59%) than a control (41%).
dfsz <- left_join(df, prs_cs, by = "IID")
dfsz$scz.status <- NA
dfsz$scz.status <- sample(0:1, length(dfsz$scz.status), c(59, 41), replace = T)
In this instance, we see no significant association between schizophrenia PGS and case-control status. However, the data is partly made up so this isnβt anything to be concerned about.
# finally look at the model itself (with the best-fit PRS as predictor)
model1cs <- standardise(glm(formula = scz.status ~ PRScs +
PC1 + PC2 + PC3 + PC4 + PC5 + SEX,
data = dfsz, family = binomial(link = "logit")))
# is the PRS significantly associated with the outcome variable?
summary(model1cs)
##
## Call:
## glm(formula = scz.status ~ PRScs + PC1 + PC2 + PC3 + PC4 + PC5 +
## SEX, family = binomial(link = "logit"), data = data_std)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.333799 0.026616 -12.541 <2e-16 ***
## PRScs 0.013620 0.026833 0.508 0.612
## PC1 -0.021462 0.026815 -0.800 0.423
## PC2 -0.008133 0.026627 -0.305 0.760
## PC3 0.005032 0.026630 0.189 0.850
## PC4 0.011931 0.026620 0.448 0.654
## PC5 0.030276 0.026629 1.137 0.256
## SEX -0.006702 0.026739 -0.251 0.802
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7892.2 on 5807 degrees of freedom
## Residual deviance: 7889.4 on 5800 degrees of freedom
## AIC: 7905.4
##
## Number of Fisher Scoring iterations: 4
Cardiff University, locksk@cardiff.ac.ukβ©οΈ